30 research outputs found

    Modern considerations for the use of naive Bayes in the supervised classification of genetic sequence data

    Get PDF
    2021 Spring.Includes bibliographical references.Genetic sequence classification is the task of assigning a known genetic label to an unknown genetic sequence. Often, this is the first step in genetic sequence analysis and is critical to understanding data produced by molecular techniques like high throughput sequencing. Here, we explore an algorithm called naive Bayes that was historically successful in classifying 16S ribosomal gene sequences for microbiome analysis. We extend the naive Bayes classifier to perform the task of general sequence classification by leveraging advancements in computational parallelism and the statistical distributions that underlie naive Bayes. In Chapter 2, we show that our implementation of naive Bayes, called WarpNL, performs within a margin of error of modern classifiers like Kraken2 and local alignment. We discuss five crucial aspects of genetic sequence classification and show how these areas affect classifier performance: the query data, the reference sequence database, the feature encoding method, the classification algorithm, and access to computational resources. In Chapter 3, we cover the critical computational advancements introduced in WarpNL that make it efficient in a modern computing framework. This includes efficient feature encoding, introduction of a log-odds ratio for comparison of naive Bayes posterior estimates, description of schema for parallel and distributed naive Bayes architectures, and use of machine learning classifiers to perform outgroup sequence classification. Finally in Chapter 4, we explore a variant of the Dirichlet multinomial distribution that underlies the naive Bayes likelihood, called the beta-Liouville multinomial. We show that the beta-Liouville multinomial can be used to enhance classifier performance, and we provide mathematical proofs regarding its convergence during maximum likelihood estimation. Overall, this work explores the naive Bayes algorithm in a modern context and shows that it is competitive for genetic sequence classification

    West African Anopheles Gambiae Mosquitoes Harbor a Taxonomically Diverse Virome Including New Insect-Specific Flaviviruses, Mononegaviruses, and Totiviruses

    Get PDF
    Anopheles gambiae are a major vector of malaria in sub-Saharan Africa. Viruses that naturally infect these mosquitoes may impact their physiology and ability to transmit pathogens. We therefore used metagenomics sequencing to search for viruses in adult Anopheles mosquitoes collected from Liberia, Senegal, and Burkina Faso. We identified a number of virus and virus-like sequences from mosquito midgut contents, including 14 coding-complete genome segments and 26 partial sequences. The coding-complete sequences define new viruses in the order Mononegavirales, and the families Flaviviridae, and Totiviridae. The identification of a flavivirus infecting Anopheles mosquitoes broadens our understanding of the evolution and host range of this virus family. This study increases our understanding of virus diversity in general, begins to define the virome of a medically important vector in its natural setting, and lays groundwork for future studies examining the potential impact of these viruses on anopheles biology and disease transmission

    Investigating Effects of Tulathromycin Metaphylaxis on the Fecal Resistome and Microbiome of Commercial Feedlot Cattle Early in the Feeding Period

    Get PDF
    The objective was to examine effects of treating commercial beef feedlot cattle with therapeutic doses of tulathromycin, a macrolide antimicrobial drug, on changes in the fecal resistome and microbiome using shotgun metagenomic sequencing. Two pens of cattle were used, with all cattle in one pen receiving metaphylaxis treatment (800 mg subcutaneous tulathromycin) at arrival to the feedlot, and all cattle in the other pen remaining unexposed to parenteral antibiotics throughout the study period. Fecal samples were collected from 15 selected cattle in each group just prior to treatment (Day 1), and again 11 days later (Day 11). Shotgun sequencing was performed on isolated metagenomic DNA, and reads were aligned to a resistance and a taxonomic database to identify alignments to antimicrobial resistance (AMR) gene accessions and microbiome content. Overall, we identified AMR genes accessions encompassing 9 classes of AMR drugs and encoding 24 unique AMR mechanisms. Statistical analysis was used to identify differences in the resistome and microbiome between the untreated and treated groups at both timepoints, as well as over time. Based on composition and ordination analyses, the resistome and microbiome were not significantly different between the two groups on Day 1 or on Day 11. However, both the resistome and microbiome changed significantly between these two sampling dates. These results indicate that the transition into the feedlot—and associated changes in diet, geography, conspecific exposure, and environment—may exert a greater influence over the fecal resistome and microbiome of feedlot cattle than common metaphylactic antimicrobial drug treatment

    Improvement to the Prediction of Fuel Cost Distributions Using ARIMA Model

    Full text link
    Availability of a validated, realistic fuel cost model is a prerequisite to the development and validation of new optimization methods and control tools. This paper uses an autoregressive integrated moving average (ARIMA) model with historical fuel cost data in development of a three-step-ahead fuel cost distribution prediction. First, the data features of Form EIA-923 are explored and the natural gas fuel costs of Texas generating facilities are used to develop and validate the forecasting algorithm for the Texas example. Furthermore, the spot price associated with the natural gas hub in Texas is utilized to enhance the fuel cost prediction. The forecasted data is fit to a normal distribution and the Kullback-Leibler divergence is employed to evaluate the difference between the real fuel cost distributions and the estimated distributions. The comparative evaluation suggests the proposed forecasting algorithm is effective in general and is worth pursuing further.Comment: Accepted by IEEE PES 2018 General Meetin

    Facial mimcry and emotion consistency : Influences of memory and context.

    Get PDF
    This study investigates whether mimicry of facial emotions is a stable response or can instead be modulated and influenced by memory of the context in which the emotion was initially observed, and therefore the meaning of the expression. The study manipulated emotion consistency implicitly, where a face expressing smiles or frowns was irrelevant and to be ignored while participants categorised target scenes. Some face identities always expressed emotions consistent with the scene (e.g., smiling with a positive scene), whilst others were always inconsistent (e.g., frowning with a positive scene). During this implicit learning of face identity and emotion consistency there was evidence for encoding of face-scene emotion consistency, with slower RTs, a reduction in trust, and inhibited facial EMG for faces expressing incompatible emotions. However, in a later task where the faces were subsequently viewed expressing emotions with no additional context, there was no evidence for retrieval of prior emotion consistency, as mimicry of emotion was similar for consistent and inconsistent individuals. We conclude that facial mimicry can be influenced by current emotion context, but there is little evidence of learning, as subsequent mimicry of emotionally consistent and inconsistent faces is similar

    Biophysically inspired rational design of structured chimeric substrates for DNAzyme cascade engineering.

    No full text
    The development of large-scale molecular computational networks is a promising approach to implementing logical decision making at the nanoscale, analogous to cellular signaling and regulatory cascades. DNA strands with catalytic activity (DNAzymes) are one means of systematically constructing molecular computation networks with inherent signal amplification. Linking multiple DNAzymes into a computational circuit requires the design of substrate molecules that allow a signal to be passed from one DNAzyme to another through programmed biochemical interactions. In this paper, we chronicle an iterative design process guided by biophysical and kinetic constraints on the desired reaction pathways and use the resulting substrate design to implement heterogeneous DNAzyme signaling cascades. A key aspect of our design process is the use of secondary structure in the substrate molecule to sequester a downstream effector sequence prior to cleavage by an upstream DNAzyme. Our goal was to develop a concrete substrate molecule design to achieve efficient signal propagation with maximal activation and minimal leakage. We have previously employed the resulting design to develop high-performance DNAzyme-based signaling systems with applications in pathogen detection and autonomous theranostics

    SCS Design 2.

    No full text
    <p>(a) Target structure. (b) MFE structure of the SCS Design 2 sequence from <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0110986#pone-0110986-t001" target="_blank">Table 1</a>. (c) Hypothesized mechanism for cleavage of Design 2, resulting in the release of an activator (Act) that can instigate a downstream TMSD reaction. (d) Response of Design 1 over 60 min. The shorter stem decreased SCS stability, resulting in increased leakage; and the rate of activation compared with leakage was negligible, likely due to the inefficiency of catalyzing hydrolysis of a cleavage site in a loop. The negative control is the downstream activity in the absence of both the SCS and the upstream DNAzyme.</p

    SCS Designs 3 and 4.

    No full text
    <p>(a) Target structures, which vary in the stem length. (b) MFE structures of the SCS Design 3 and 4 sequences from <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0110986#pone-0110986-t001" target="_blank">Table 1</a>. (c) Hypothesized mechanism for cleavage of Designs 3 and 4, illustrated with Design 3. (d) Response of Design 3 over 25 min. (e) Response of Design 4 over 30 min. For both designs, rapid activation was achieved. However, the rate of leakage also increased, indicating that the protection of the toehold was insufficient. This was likely due to the relatively short stem and large loop. The leakage was lower with Design 4 due to the longer stem.</p
    corecore